{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "polyglot\n", "===============================" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Downloads](https://img.shields.io/pypi/dm/polyglot.svg \"Downloads\")](https://pypi.python.org/pypi/polyglot)\n", "[![Latest Version](https://badge.fury.io/py/polyglot.svg \"Latest Version\")](https://pypi.python.org/pypi/polyglot)\n", "[![Build Status](https://travis-ci.org/aboSamoor/polyglot.png?branch=master \"Build Status\")](https://travis-ci.org/aboSamoor/polyglot)\n", "[![Documentation Status](https://readthedocs.org/projects/polyglot/badge/?version=latest \"Documentation Status\")](https://readthedocs.org/builds/polyglot/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Polyglot is a natural language pipeline that supports massive multilingual applications.\n", "\n", "* Free software: GPLv3 license\n", "* Documentation: http://polyglot.readthedocs.org.\n", "\n", "###Features\n", "\n", "\n", "* Tokenization (165 Languages)\n", "* Language detection (196 Languages)\n", "* Named Entity Recognition (40 Languages)\n", "* Part of Speech Tagging (16 Languages)\n", "* Sentiment Analysis (136 Languages)\n", "* Word Embeddings (137 Languages)\n", "* Morphological analysis (135 Languages)\n", "* Transliteration (69 Languages)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Developer\n", "\n", "* Rami Al-Rfou @ `rmyeid gmail com`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Quick Tutorial" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import polyglot\n", "from polyglot.text import Text, Word" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Language Detection" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Language Detected: Code=fr, Name=French\n", "\n" ] } ], "source": [ "text = Text(\"Bonjour, Mesdames.\")\n", "print(\"Language Detected: Code={}, Name={}\\n\".format(text.language.code, text.language.name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']\n" ] } ], "source": [ "zen = Text(\"Beautiful is better than ugly. \"\n", " \"Explicit is better than implicit. \"\n", " \"Simple is better than complex.\")\n", "print(zen.words)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Sentence(\"Beautiful is better than ugly.\"), Sentence(\"Explicit is better than implicit.\"), Sentence(\"Simple is better than complex.\")]\n" ] } ], "source": [ "print(zen.sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part of Speech Tagging" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word POS Tag\n", "------------------------------\n", "O DET\n", "primeiro ADJ\n", "uso NOUN\n", "de ADP\n", "desobediência NOUN\n", "civil ADJ\n", "em ADP\n", "massa NOUN\n", "ocorreu ADJ\n", "em ADP\n", "setembro NOUN\n", "de ADP\n", "1906 NUM\n", ". PUNCT\n" ] } ], "source": [ "text = Text(u\"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.\")\n", "\n", "print(\"{:<16}{}\".format(\"Word\", \"POS Tag\")+\"\\n\"+\"-\"*30)\n", "for word, tag in text.pos_tags:\n", " print(u\"{:<16}{:>2}\".format(word, tag))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Named Entity Recognition" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[I-LOC([u'Gro\\xdfbritannien']), I-PER([u'Gandhi'])]\n" ] } ], "source": [ "text = Text(u\"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden\")\n", "print(text.entities)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Polarity" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word Polarity\n", "------------------------------\n", "Beautiful 0\n", "is 0\n", "better 1\n", "than 0\n", "ugly -1\n", ". 0\n" ] } ], "source": [ "print(\"{:<16}{}\".format(\"Word\", \"Polarity\")+\"\\n\"+\"-\"*30)\n", "for w in zen.words[:6]:\n", " print(\"{:<16}{:>2}\".format(w, w.polarity))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Embeddings" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Neighbors (Synonms) of Obama\n", "------------------------------\n", "Bush \n", "Reagan \n", "Clinton \n", "Ahmadinejad \n", "Nixon \n", "Karzai \n", "McCain \n", "Biden \n", "Huckabee \n", "Lula \n", "\n", "\n", "The first 10 dimensions out the 256 dimensions\n", "\n", "[-2.57382345 1.52175975 0.51070285 1.08678675 -0.74386948 -1.18616164\n", " 2.92784619 -0.25694436 -1.40958667 -2.39675403]\n" ] } ], "source": [ "word = Word(\"Obama\", language=\"en\")\n", "print(\"Neighbors (Synonms) of {}\".format(word)+\"\\n\"+\"-\"*30)\n", "for w in word.neighbors:\n", " print(\"{:<16}\".format(w))\n", "print(\"\\n\\nThe first 10 dimensions out the {} dimensions\\n\".format(word.vector.shape[0]))\n", "print(word.vector[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Morphology" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u'Pre', u'process', u'ing']\n" ] } ], "source": [ "word = Text(\"Preprocessing is an essential step.\").words[0]\n", "print(word.morphemes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transliteration" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "препрокессинг\n" ] } ], "source": [ "from polyglot.transliteration import Transliterator\n", "transliterator = Transliterator(source_lang=\"en\", target_lang=\"ru\")\n", "print(transliterator.transliterate(u\"preprocessing\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }